Here I am writing a short description about the course (Or perhaps more like a few thoughts about the course). I include a link to my GitHub repository here as well. This is an R Markdown (.Rmd) file so I’ll be using R Markdown syntax.
The R chunck below prints out the date and time at the time of knitting this Markdown document for no particular reason.
# This is a so-called "R chunk" where I can write my R code.
date()
## [1] "Thu Dec 1 11:57:42 2022"
Okay, so I’m feeling great and on this course I expect to learn all kinds of both interesting and useful things about Github, Markdown and reproducible research in general. Also, I learned about the course from Sisu when looking for useful and interesting courses among the ones listed suitable for my PhD degree (that is, there is a list of courses from which one needs to choose at least 10 study points in order to complete one’s degree).
I was pretty familiar already with the material covered in ‘R for Health Data Science book’ and in the ‘Exercise Set 1’, since I have a strong background in both statistics/data science and in R. However, I’m motivated to learn as many new tools as possible to make my research as open and reproducible as possible. I also wish to learn ways to streamline my research workflows by more fluent integration of coding, writing and publishing via Github, Markdown and R-studio.
Last but not the least, here is a link to my Github repository used during the course: My Github Repo
students2014 <- read.csv("./data/learning2014.csv")
str(students2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
Above we have loaded the dataset to be used for this week’s assignment and printed a short summary of the data. The data consists of:
2 variables of type ‘int’ (age, points),
4 variables of type ‘num’ (attitude, deep, stra, surf), and
1 variables of type ‘character’ (gender).
Also, there are 166 observations per variable and in the data are results from a survey with originally 183 respondents. Respondents were course participants and those with zero exam points from the course exam were removed from the data. The variables deep, stra and surf are aggregate variables constructed from multiple responses. More information on the dataset is available here.
Below are marginal distributions for every variable in the data, except for the variable gender, which is a binary variable for which 66% respondents answered F and 34% respondents answered M.
While marginal distributions are useful in assessing the shape of the distribution of individual variables, we may use pairs to draw scatter plots summarizing the pairwise dependencies between different variables.
Based on quick visual assessment, there might be at least some positive correlation between points and attitude. Much of the dependencies are however not clear enough, but can be be a little difficult to read from the pairwise scatter plots. To this end we may also inspect the correlation matrix of the data (below).
## age attitude deep stra surf points gender_binary
## age 1.00 0.02 0.03 0.10 -0.14 -0.09 -0.12
## attitude 0.02 1.00 0.11 0.06 -0.18 0.44 -0.29
## deep 0.03 0.11 1.00 0.10 -0.32 -0.01 -0.06
## stra 0.10 0.06 0.10 1.00 -0.16 0.15 0.15
## surf -0.14 -0.18 -0.32 -0.16 1.00 -0.14 0.11
## points -0.09 0.44 -0.01 0.15 -0.14 1.00 -0.09
## gender_binary -0.12 -0.29 -0.06 0.15 0.11 -0.09 1.00
Above we have created coded the variable gender as to a binary variable which gets values equal to one if gender = F and otherwise zero. The correlation between points and attitude is the strongest linear dependency, as our visual assessment already suggested.
We shall take take the three variables with the largest absolute correlations with points as our starting point for the analysis. That is, we estimate a linear regression model with points as the dependent variables and three explanatory variables (attitude, stra and surf). The results of our regression model are summarized below (standard errors are not heteroskedasticity robust).
model1 <- lm(points ~ attitude + stra + surf, data = students2014)
summary(model1)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
Above reported t-test statistics suggest that the coefficients for stra and surf are not statistically significant from zero, as already suggested by low absolute correlations between them and points. To elaborate, t-test tests the null hypothesis coefficient = 0 for every variable in the model and only the coefficient on attitude was found to be statistically significant (and clearly so!).
To that end, we drop both stra and surf from our regression and run the final model with attitude as the only explanatory variable.
model2 <- lm(points ~ attitude, data = students2014)
summary(model2)
##
## Call:
## lm(formula = points ~ attitude, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The intercept of the final model is estimated to be 11.64, which can be interpreted as the model predicting exam points to be that much on average for someone with attitude = 0. On the other hand, the more interesting coefficient on attitude is estimated to be 3.53, which suggests that on average, every additional point in attitude raises (or is associated with) 3.53 point rise in exam points. The multiple R-squared for the model is around 19%, suggesting the lone explanatory model to explain that much of the variance of the model (or variation in the dependent variable points).
Next, we shall take a look at some visual model diagnostics and discuss a little about the assumptions of our model. Let’s see the visual diagnostics first.
Above we have plotted Residuals vs Fitted values, Normal QQ-plot, Residuals vs Leverage, points against attitude with our linear regression line going over the scatter plot and finally a histogram of the residuals. First, as both Normal QQ-plot and the histogram of residuals suggests, the residuals of the model are slightly skewed, left tail being thicker than normal, but right tail thinner. That is, residuals do not look normally distributed. Moreover, there does not seem to be too wild outliers in the data, nor are the results dominated by particular observations as suggested by relatively low maximum leverage. The residuals seem also quite homoskedastic (although debatable and probably heteroskedasticity ribust standard errors should be used just in case).
Fortunately linear regression model does not require for us to assume to residuals to be normally distributed. Technically, it only needs to be assumed that the residuals have well defined fourth moments (and in this case homoskedastic). Our coefficient estimates and estimates for their standard deviation will be asymptotically correct. Also we may safely assume the residuals based on survey based data to be uncorrelated with each other. The dependence between points and attitudealso seems fairly linear, as required by our model assumptions.
Thus, our model assumptions seem relatively well justified.
alc <- read.csv("./data/alc.csv")
str(alc)
## 'data.frame': 370 obs. of 35 variables:
## $ school : chr "GP" "GP" "GP" "GP" ...
## $ sex : chr "F" "F" "F" "F" ...
## $ age : int 15 15 15 15 15 15 15 15 15 15 ...
## $ address : chr "R" "R" "R" "R" ...
## $ famsize : chr "GT3" "GT3" "GT3" "GT3" ...
## $ Pstatus : chr "T" "T" "T" "T" ...
## $ Medu : int 1 1 2 2 3 3 3 2 3 3 ...
## $ Fedu : int 1 1 2 4 3 4 4 2 1 3 ...
## $ Mjob : chr "at_home" "other" "at_home" "services" ...
## $ Fjob : chr "other" "other" "other" "health" ...
## $ reason : chr "home" "reputation" "reputation" "course" ...
## $ guardian : chr "mother" "mother" "mother" "mother" ...
## $ traveltime: int 2 1 1 1 2 1 2 2 2 1 ...
## $ studytime : int 4 2 1 3 3 3 3 2 4 4 ...
## $ schoolsup : chr "yes" "yes" "yes" "yes" ...
## $ famsup : chr "yes" "yes" "yes" "yes" ...
## $ activities: chr "yes" "no" "yes" "yes" ...
## $ nursery : chr "yes" "no" "yes" "yes" ...
## $ higher : chr "yes" "yes" "yes" "yes" ...
## $ internet : chr "yes" "yes" "no" "yes" ...
## $ romantic : chr "no" "yes" "no" "no" ...
## $ famrel : int 3 3 4 4 4 4 4 4 4 4 ...
## $ freetime : int 1 3 3 3 2 3 2 1 4 3 ...
## $ goout : int 2 4 1 2 1 2 2 3 2 3 ...
## $ Dalc : int 1 2 1 1 2 1 2 1 2 1 ...
## $ Walc : int 1 4 1 1 3 1 2 3 3 1 ...
## $ health : int 1 5 2 5 3 5 5 4 3 4 ...
## $ failures : num 0.5 1 0 0 1 0 1 0 0 0 ...
## $ paid : chr "yes" "no" "yes" "yes" ...
## $ absences : num 3 2 8 2 5 2 0 1 9 10 ...
## $ G1 : num 10 10.5 14 10 11.5 11.5 11 9.5 15.5 10 ...
## $ G2 : num 11.5 8.5 13 10 11.5 12 5.5 9.5 15.5 10.5 ...
## $ G3 : num 11.5 8 12.5 9 11.5 11.5 6 9.5 15.5 10.5 ...
## $ alc_use : num 1 3 1 1 2.5 1 2 2 2.5 1 ...
## $ high_use : logi FALSE TRUE FALSE FALSE TRUE FALSE ...
Above we have printed some information of the data to be used in this week’s assignment, including variable names as well as data types of the variables. In total we have 370 observations and 35 variables. The data are student level results from a questionnaire in two Portuguese schools (secondary education) and the variables measure student achievement as well as demographic and social features. Based on this data, we shall study the relationship between alcohol consumption and some of the variables.
Of other variables than those related to alcohol consumption, we shall focus our attention on higher, romantic, famrel and G3 (final grade). The working hypothesis is that the variables measuring social aspects (romantic and famrel) could be negatively correlated with alcohol consumption, whereas the variables related to ambitions and achievement (higher and G3) could as well be negatively associated with alcohol consumption, since excessive alcohol could make it more difficult to succeed in school.
For the analysis we need to encode romantic and higher (both take character values yes or no) to a binary variables taking values 1 and zero.
alc$romantic_binary <- ifelse(alc$romantic == "yes", 1, 0)
alc$higher_binary <- ifelse(alc$higher == "yes", 1, 0)
Below we see the marginal distributions of the four explanatory variables for both high_use = TRUE and high_use = FALSE. Variable high_use takes values TRUE if the weekly alcohol consumption exceeds a certain threshold. For 30% of the students in the sample high_use = TRUE.
The conditional marginal distributions of the variables provide preliminary support for all of our hypotheses. First, higher is a binary variable taking zero values only if the student is not thinking about pursuing higher education and thus reflects the educational ambitions. As suspected, among those with lower educational ambitions, high alcohol consumption is much more common. The difference might not look huge in the plot since for only 4% of the students higher = 0, but among those (16) students high alcohol consumption was approximately twice as probable as in the rest of the data! (see table below)
##
## higher = 1 higher = 0
## high_use = TRUE 7 252
## high_use = FALSE 9 102
As for the next variable, whether student is in a romantic relationship does not seem to have much effect on one’s alcohol consumption. However, high_use = TRUE seems to be slightly more common for romantic = “no”, as hypothesized. Good family relations (higher values of famrel) on the other hand seem to be clearly associated with less alcohol consumption, also as hypothesized. Finally, perhaps the most clear negative association (by visual inspection at least) is between high alcohol consumption and final grade, or educational success, which makes perfect sense, and is also as we suspected.
Next, we shall estimate a logistic regression model using the four just discussed variables as explanatory variables and high_use as the dependent variable. Below we have printed out the summary of our results.
# Logistic regression model
model <- glm(high_use ~ higher_binary + romantic_binary + famrel + G3,
data = alc, family = binomial(link = "logit"))
# A summary of the estimated model
summary(model)
##
## Call:
## glm(formula = high_use ~ higher_binary + romantic_binary + famrel +
## G3, family = binomial(link = "logit"), data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4342 -0.8506 -0.7251 1.3382 1.9521
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.06805 0.77766 2.659 0.00783 **
## higher_binary -0.90841 0.55444 -1.638 0.10133
## romantic_binary -0.33272 0.25738 -1.293 0.19611
## famrel -0.28414 0.12473 -2.278 0.02272 *
## G3 -0.07423 0.03716 -1.997 0.04578 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 436.57 on 365 degrees of freedom
## AIC: 446.57
##
## Number of Fisher Scoring iterations: 4
# Odds ratios
odds_ratios <- exp(model$coefficients)
# Confidence intervals
ci <- confint(model)
## Waiting for profiling to be done...
# Printing out both odds ratios and their confidence intervals
print(cbind(odds_ratios, exp(ci)))
## odds_ratios 2.5 % 97.5 %
## (Intercept) 7.9093629 1.7560948 37.6713039
## higher_binary 0.4031642 0.1315342 1.1948265
## romantic_binary 0.7169700 0.4283564 1.1778900
## famrel 0.7526588 0.5885479 0.9614471
## G3 0.9284590 0.8625985 0.9983319
The estimated coefficient of the logistic regression model are all negative (and equivalently, the point estimates for odds ratios less than one), and thus consistent with our hypothesis of all the included explanatory variables to be negatively associated with the high alcohol consumption. The results are however not very strong. The negative association between alcohol consumption and the first two explanatory variables (romantic and higher) is not statistically significant on any commonly used confidence levels. This comes as no surprise after the inspection of the conditional marginal distributions of those variables, where the difference between high_use = TRUE and high_use = FALSE was quite small. Also, as discussed, there were only 16 students for which higher_binary = 1. Hence, although the point estimate for the coefficient of higher (in absolute value) was actually the highest, more data would be needed to obtain statistically significant results.
On the other hand, the coefficients on famrel and G3 were found to be statistically significant on a 95% level, albeit small-ish in absolute value. Based on these findings, we could say the data and the model to give some support to our hypothesis of both healthy social relations and educational ambitions to be negatively associated with high alcohol consumption.
However, as discussed, the odds ratios for the statistically significant explanatory variables were not very different from one (no predictive power) the 95% confidence intervals for those both almost including one. To make things worse, the point estimates for the statistically not significant coefficients were the largest in absolute value, hence it is very plausible there to be some over-fitting going on. The predictive performance of the model (based on those point estimates) might then not be all that amazing. Let us assess this by first printing out a 2x2 cross tabulation of predictions versus the actual values observed (see below). As a decision boundary we use the value of 50%.
## prediction
## high_use 0 1
## 0 249 10
## 1 105 6
Clearly, the model inaccurately classifies almost all of the students as high_use = 0. This is no surprise, since for 70% of the students this is indeed the case, thus by classifying every student like this, one would already obtain a respectable (not really) accuracy rate of 70%.
So does our model perform any better than such a simple guessing strategy? Unfortunately it turns out not to be the case… The accuracy rate of our model is only 68.92% (31.08% training error) against the 70% (30% training error) of the simple guessing strategy.
To conclude, the predictive performance of our model is poor, reflecting the weakness of our obtained results. However, the data and the model do give some support to our hypothesis of both healthy social relations and educational ambitions to be negatively associated with high alcohol consumption.
library(MASS)
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Above we have loaded a dataset called Boston from the MASS package. The dataset consists of crime rates in various towns in Boston area, along with various other town characteristics. In total there are 506 observations (towns) and 14 variables.
Below are the marginal distributions as well as pair plots of all the variables in the data.
par(mfrow = c(3, 5))
par(mar = c(3, 4, 2, 1))
for(i in 1:ncol(Boston)) hist(Boston[,i], main = colnames(Boston)[i], xlab = "", col = "peachpuff")
par(mfrow = c(1, 1))
Clearly there are a lot of irregularities in the distribution of the data. Specifically, many of the marginal distributions are seriously skewed and even multimodal. Hence, no methods assuming normality or near normality of the data should be employed here. Let us next draw a correlation plot to assess the linear dependencies between the variables.
library(corrplot)
## corrplot 0.92 loaded
cor_matrix <- cor(Boston)
corrplot(cor_matrix, method = "circle", type = "upper")
There are multiple strong, both positive and negative, correlations between many of the variables suggesting the variables do indeed have some explanatory power over each other. Below we print out the five largest positive and negative correlations in the data.
print_largest_correlations <- function(data, how_many = 5, negative = FALSE, digits = 2) {
cm <- cor(data); diag(cm) <- NA; n <- ncol(cm)
if(how_many > n * (n - 1) / 2) how_many <- n * (n - 1) / 2
cm_order <- order(cm[upper.tri(cm)])
largest <- ifelse(rep(negative, how_many), cm_order[1:how_many], rev(cm_order)[1:how_many])
strings <- c(); cors <- c()
count <- 1
for(j in 1:n) {
for(i in 1:n) {
if(i >= j) next
if(count %in% largest) {
cors <- c(cors, cor(data[,i], data[,j]))
strings <- c(strings, paste0("cor(", colnames(data)[i], ", ", colnames(data)[j], ") = ",
round(cors[length(cors)], digits), "\n"))
}
count <- count + 1
}
}
strings <- strings[order(cors, decreasing = !negative)]
if(!negative) cat("Largest positive correlations: \n") else cat("Largest negative correlations: \n")
for(i in 1:length(strings)) cat(strings[i])
cat("\n")
}
print_largest_correlations(Boston)
## Largest positive correlations:
## cor(rad, tax) = 0.91
## cor(indus, nox) = 0.76
## cor(nox, age) = 0.73
## cor(indus, tax) = 0.72
## cor(rm, medv) = 0.7
print_largest_correlations(Boston, negative = TRUE)
## Largest negative correlations:
## cor(nox, dis) = -0.77
## cor(age, dis) = -0.75
## cor(lstat, medv) = -0.74
## cor(indus, dis) = -0.71
## cor(rm, lstat) = -0.61
So we have established there to be some linear dependencies between the variables in the data, but what about non-linear dependencies? We may try to visually assess any non-linear dependencies by means of a pair plot (plotted below).
pairs(Boston, gap = 0.5, oma = c(0, 0, 0, 0), pch = 20)
Visual assessment doesn’t immediately reveal us any strong dependencies in addition to those linear enough for to result in correlations with high absolute values. However, some dependencies for which the correlations seemed high seem to exhibit significant non-linearities (see e.g. the relationship between nox and dis) and naive linear models might not be the best approach to model the dependencies in this data.
Before diving to model fitting and further analysis, we shall do some preparations for our data. First we scale the data to have zero mean and unit variance.
boston_scaled <- data.frame(scale(Boston))
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
We shall also replace the numerical variable crim (crime rates) with a categorical counterpart with four levels (“low”, “med_low”, “med_high”, “high”) using interquartiles as break points.
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE,
labels = c("low", "med_low", "med_high", "high"))
boston_scaled$crim <- crime
We then divide the data to a train and test sets with a random 80/20 split.
# Random 80/20 split:
# (floor(n * 0.8) returns an integer -> less ambiguous than n * 0.8, also
# 'sample.int' slightly more efficient and less ambiguous than 'sample')
n <- nrow(boston_scaled)
ind <- sample.int(n, size = floor(n * 0.8))
train <- data.frame(boston_scaled[ind,], stringsAsFactors = TRUE)
test <- data.frame(boston_scaled[-ind,], stringsAsFactors = TRUE)
# Save the correct classes from test data (just in case)...
correct_classes <- test$crim
# ...after which remove the crime variable from test data:
test$crim <- NULL
Then, although we have established the non-normality of the variables, we shall use something that assumes the data to follow a multinormal distribution, because why not! Linear discriminant analysis it is (LDA).
# LDA model
lda.fit <- MASS::lda(crim ~ ., data = train)
# The function for LDA biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# Target classes as numeric
classes <- as.numeric(train$crime)
# Plot the LDA biplot
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 1)
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 12 10 0 0
## med_low 5 18 2 0
## med_high 0 6 18 2
## high 0 0 1 28
Above we have plotted our results by means of an LDA biplot and then printed out the cross tabulation of the predictions made by the model on the test data. Evidently, regardless of the violation of the LDA assumptions, LDA seems to be able to differentiate between different classes and the test error rate seems fairly low. The model has most difficulties in differentiating between categories “med_low” and “med_high” which is to be expected. None of the “low” (“high”) observations is however classified as “med_high” or “high” (“med_low” or “low”) which I’d say is fairly well done.
However, perhaps there are other methods that could perform even better? To that end, let us compare the results attained with LDA to those attained by means of k-means algorithm. For one, k-means algorithm does not assume the variables to follow a multivariate normal distribution so perhaps it performs better than LDA? Let us first reload and scale the data, take quick look at the distances between observations in the data, and then run the k-means algorithm first with four clusters.
# Reload tha original data
library(MASS)
data("Boston")
# Scale the data
boston_scaled <- data.frame(scale(Boston))
# Calculate the (Euclidean) distances of observations in the data
dist <- dist(Boston)
summary(dist)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.119 85.624 170.539 226.315 371.950 626.047
# Set random seed
set.seed(42)
# Run the k-means algorithm
km <- kmeans(Boston, centers = 4)
# Visualize the results with a pairplot
pairs(Boston, col = km$cluster, gap = 0.5, oma = c(0, 0, 0, 0), pch = 20)
For some variables-pairs the four clusters seem to do a great job at differentiating the data (as depicted by the above pairplot), whereas for others the clusters are less clear. To that end, let us assess how the total of within cluster sum of squares (WCSS) behaves as a function of the number of clusters.
# Set random seed
set.seed(42)
# Set the max number of clusters
k_max <- 10
# Calculate the total within sum of squares for every number of clusters
TWCSS <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
# Visualize the results
plot(x = 1:k_max, y = TWCSS, type = "l", xlab = "Number of Clusters", lwd = 2)
grid()
The most drastic changes in TWCSS happen when two clusters are included. After that the decrease in TWCSS is still significant up to the inclusion of four clusters after which any gains are more or less marginal. The originally used four clusters might then be pretty close to the optimal number of clusters.
Although we didn’t really get to do comparisons of the LDA and k-means algorithm in the end, both approaches seemed to be able to fit the data quite well and interesting insights to the data could most probably be drawn with either method.
human <- readRDS("data/human.rds")
# Pairplot
pairs(human, gap = 0.5, oma = c(0, 0, 0, 0), pch = 20, xaxt = "n", yaxt = "n")
#Correlation plot
corrplot(cor(human), type = "upper")
# Histograms of marginal distirbutions
par(mfrow = c(3, 3))
par(mar = c(2, 3.5, 3, 1))
for(i in 1:ncol(human)) {
hist(human[,i], col = "white", main = colnames(human)[i], xlab = "", ylab = "")
}
par(mfrow = c(1, 1))
This week we will be working with the ‘human’ dataset orginating from the United Nations Development Programme. Above we have produced a visual summarization of our data, including a pairplot, a correlation plot and histograms of the marginal distributions of variables. Evidently, there are plenty of strong dependencies between the data 8, some less linear than the others. The linear dependencies are well captured by the correlation plot whereas the less linear dependencies are made more clear by the pairplot. For instance, according to the pairplot, perhaps the strongest pairwise dependency is between GNI and Mat.Mor. While they are indeed negatively correlated, the correlation plot alone would give a false impression of only relatively modest dependency between the variables.
As all the variables have positive support, most of the marginal distributions are also relatively skewed, as expected. Only the marginal distribution of Edu.Exp could perhaps be approximated with a normal distribution with acceptable accuracy. Non-linear dependencies and non-normality of marginal distributions are indeed often two phenomena related to each other
We shall begin our analysis of the data by performing a principal component analysis (PCA) with unnormalized data (not very smart, as we will see!).
# Principal component analysis (PCA)
pca_human <- prcomp(human)
s <- summary(pca_human)
# Rounded percentages of variance captured by each component
pca_pr <- round(100*s$importance[2, ], digits = 0)
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# Draw a biplot
biplot(pca_human, cex = c(0.6, 0.8), col = c("grey40", "tomato"),
xlab = pc_lab[1], ylab = pc_lab[2])
Evidently, the unnormalized data does not allow for meaningful analysis, but the first principal component explains 99.99% of the variance in the data, as also illustrated by the non-informative biplot above. To this end we shall normalize the data (to have zero mean and unit variance for all variables) and repeat the principal component analysis.
# Normalize the data
human_std <- scale(human)
# Principal component analysis (PCA)
pca_human <- prcomp(human_std)
s <- summary(pca_human)
# The variance in the data captured as a function of the number of components
plot(1:ncol(s$imp), s$imp[3,] * 100, ylim = c(0, 100), ylab = "Variance Explained, %",
xlab = "Number of Principal Components", type = "l", lwd = 2)
grid()
# Rounded percentages of variance captured by each component
pca_pr <- round(100*s$importance[2,], digits = 0)
pc_lab1 <- paste0(names(pca_pr)[1],
c(" - Quality of Life"),
" (", pca_pr[1], "%)")
pc_lab2 <- paste0(names(pca_pr)[2],
c(" - Gender Equality"),
" (", pca_pr[2], "%)")
# Draw a biplot
biplot(pca_human, cex = c(0.6, 0.8), col = c("grey40", "tomato"),
xlab = pc_lab1, ylab = pc_lab2, xlim = c(-0.2, 0.2))
After normalizing the data our PCA looks much better! This is obvious, as without normalizing the data, the effect of different variables to the analysis is dominated by the scales of the variables. For instance, in the case of our unnormalized data, the variance of GNI was orders of magnitudes larger than that of the other variables, rendering the any other variables irrelevant.
Now the first two principal components explain approximately a reasonable 54% and 16% of the data variability, respectively. Inspection of the biplot (or alternatively of the rotation matrix) gives us some further insights to the results.
The first component seems to capture aspects coarsely related to the quality of life (i.e. for instance health and economics related factors), as two variables related to education and the life expectancy variable are both negatively correlated with the first component, whereas adolescent birth rate as well as maternal mortality ratio are positively correlated with the component. This component alone explains a whopping 54% of the variability of the data!
The second component seems to capture gender equality related phenomena as it is mostly correlated with female labour participation ratio and percent representation of females in parliament. This is another 16% of the data variability explained!
We are now done for this dataset and we shall turn our attention to tea (yes, the drink) next. To that end, next we load results of questionnaire of 300 participants, in which they were asked all kinds of things tea related, and take a look at that data.
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
# NB: View() does not play well with markdown documents so we have commented it out...
# View(tea)
#...we can however take a quick peek of the data in the context of this markdown
# document for example like this:
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
## $ frequency : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
Above we have printed out the first few observations of the data as well as variable names and types. In summary, there are 36 variables, all except age of type Factor with two to six different levels. Also, there are obviously 300 rows in the data, one observation per participant.
Next we shall visualize the data by plotting the distributions of answers to a handful of interesting questions as well as the distribution of participant ages.
# Create a new smaller dataset of some interesting variables
tea_2 <- select(tea, one_of(c("breakfast", "dinner", "evening",
"Tea", "how", "sugar",
"where", "How", "friends")))
# Shorten some names for clarity
levels(tea_2$how) <- c("bag", "bag+unpck.", "unpck.")
levels(tea_2$where) <- c("store", "store+shop", "shop")
# Plot the variables as well as the distribution of participant ages
pivot_longer(tea_2, cols = everything()) %>%
ggplot(aes(value)) + facet_wrap("name", scales = "free") + geom_bar()
hist(tea$age, col = "white", xlab = "Age", ylab = "", main = "")
Let us then run a Multiple Correspondence Analysis (MCA) on the new data with only the chosen variables.
# MCA
mca <- MCA(tea_2, graph = FALSE)
# Percentage of explained variance by dimensions
fviz_screeplot(mca, addlabels = TRUE, ylim = c(0, 20))
# Biplot
fviz_mca_biplot(mca, repel = TRUE, ggtheme = theme_minimal())
Above we have plotted a scree plot of the percentage of explained variances by dimensions and a biplot summarizing the results of MCA. Evidently, we have not been hugely efficient in summarizing the aspects of data with fewer dimensions than in the original data, since the first two dimensions (of ten) explain only 12.29% and 12.01% of the total variance.
Accordingly, it is relatively difficult to tease out clear interpretation of the first two dimensions from the biplot, apart from a few clear connections between the variables. For instance, people who prefer their tea unpacked also tend to prefer tea shops instead of chain stores, which isn’t a great surprise.